feat: add deterministic seed_random management command for synthetic … by abdihakim92x1 · Pull Request #3041 · UKHSA-Internal/data-dashboard-api

abdihakim92x1 · 2026-03-04T00:31:22Z

…dev data

Description

This PR includes the following:

Introduces python manage.py seed_random
Supports dataset selection: cms | metrics | both
Supports scale options: small | medium | large
Deterministic seeding via --seed
Optional --truncate-first for safe reset of metrics data
Reuses build_cms_site for CMS baseline
Bulk inserts CoreTimeSeries and APITimeSeries for performance
Designed for personal dev environment seeding

Fixes #CDD-3154 Partly

Type of change

Please select the options that are relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Tech debt item (this is focused solely on addressing any relevant technical debt)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests at the right levels to prove my change is effective
I have added screenshots or screen grabs where appropriate
I have added docstrings in the correct style (google)

jeanpierrefouche-ukhsa

Good code! Please see my comments.

metrics/interfaces/management/commands/seed_random.py

jrdh

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

metrics/interfaces/management/commands/seed_random.py

jrdh · 2026-03-09T09:01:09Z

metrics/interfaces/management/commands/seed_random.py

+        Stratum.objects.all().delete()
+
+    @classmethod
+    def _seed_time_series_rows(


Can we add a docstring please?

metrics/interfaces/management/commands/seed_random.py

abdihakim92x1 · 2026-03-10T00:32:06Z

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

Thanks, agree this AC calls for generating payloads into the ingestion S3 bucket rather than writing directly to metrics tables.
The current implementation intentionally focused on quickly unblocking realistic local/dev data setup and validating model-level seeding behavior first, which is why it writes directly to DB.
I agree this does not satisfy the ingestion-path AC as written. I propose a follow-up change to pivot this command to produce JSON files and upload to the target env ingestion bucket, so we exercise ingestion + validation end-to-end and avoid duplicating DB write logic.
If helpful, I can raise that as a separate PR to keep scope/review risk controlled.

jrdh

As well as the few comments throughout, it'd also be good to add a system test added in the same vein as the build_cms_site system tests - i.e. use the actual command with the database and confirm that we get what we expect out of it using the API / querying the database.

Otherwise looks decent so far, though I haven't had a chance to test it in my aws env yet. I'll reply to the comment about the AC in a mo.

jrdh · 2026-03-17T09:29:38Z

metrics/interfaces/management/commands/seed_random.py

+        truncate_first: bool,
+        progress_callback: Callable[[str], None] | None = None,
+    ) -> dict[str, int]:
+        """Seed supporting metric models and time series rows for the selected scale."""


Can we add param docs please?

jrdh · 2026-03-17T10:01:18Z

metrics/interfaces/management/commands/seed_random.py

+    @classmethod
+    def _seed_theme_hierarchy(cls) -> tuple[list[Theme], list[SubTheme], list[Topic]]:
+        theme_names, sub_theme_rows, topic_rows = cls._build_theme_hierarchy_records()
+        themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names])


When testing locally, if I run uhd bootstrap all and then run this seeding command without any truncation (e.g. python manage.py seed_random --dataset metrics) I get integrity errors. Try seed 1773741316 and you should get the same. Essentially the hierarchy created here needs to either take the existing hierarchy into account when it's generated or when bulk creating the records in the db we need to be ok with and manage the integrity errors.

Here's the stack trace:

((.venv) ) ➜ data-dashboard-api git:(task/CDD-3154-seed-random-data) python manage.py seed_random --dataset metrics Seed used: 1773741316 Seeding metrics dataset... Preparing metric taxonomy and geography records... Traceback (most recent call last): File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute return self.cursor.execute(sql, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute return super().execute(query, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ sqlite3.IntegrityError: UNIQUE constraint failed: data_theme.name The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/josh/work/data-dashboard-api/manage.py", line 23, in <module> main() File "/home/josh/work/data-dashboard-api/manage.py", line 19, in main execute_from_command_line(sys.argv) File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line utility.execute() File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 436, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 420, in run_from_argv self.execute(*args, **cmd_options) File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 464, in execute output = self.handle(*args, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 96, in handle counts = self._seed_metrics_data( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 133, in _seed_metrics_data themes, sub_themes, topics = cls._seed_theme_hierarchy() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 259, in _seed_theme_hierarchy themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 376, in _bulk_create return model.objects.bulk_create(list(records)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/manager.py", line 87, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 825, in bulk_create returned_columns = self._batched_insert( ^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1901, in _batched_insert self._insert( File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1873, in _insert return query.get_compiler(using=using).execute_sql(returning_fields) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1882, in execute_sql cursor.execute(sql, params) File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 122, in execute return super().execute(sql, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute return self._execute_with_wrappers( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers return executor(sql, params, many, context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 100, in _execute with self.db.wrap_database_errors: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/utils.py", line 91, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute return self.cursor.execute(sql, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute return super().execute(query, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ django.db.utils.IntegrityError: UNIQUE constraint failed: data_theme.name

jrdh · 2026-03-19T09:17:54Z

metrics/interfaces/management/commands/seed_random.py

+                        geography=geography,
+                        stratum=stratum,
+                        age=age,
+                        sex=None,


Could we randomise this between the currently allowed values (male, female, all)?

jrdh · 2026-03-19T09:29:52Z

metrics/interfaces/management/commands/seed_random.py

+    "small": {"geographies": 5, "metrics": 10, "days": 30},
+    "medium": {"geographies": 20, "metrics": 50, "days": 180},
+    "large": {"geographies": 100, "metrics": 200, "days": 365},
+}


Would be nice to add a comment here to give an indication of the scale of these. From testing myself, small gets you about 1500 time series records, medium gets you 180000, and large gets you 7.3 million.

jrdh · 2026-03-19T10:01:25Z

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

Thanks, agree this AC calls for generating payloads into the ingestion S3 bucket rather than writing directly to metrics tables. The current implementation intentionally focused on quickly unblocking realistic local/dev data setup and validating model-level seeding behavior first, which is why it writes directly to DB. I agree this does not satisfy the ingestion-path AC as written. I propose a follow-up change to pivot this command to produce JSON files and upload to the target env ingestion bucket, so we exercise ingestion + validation end-to-end and avoid duplicating DB write logic. If helpful, I can raise that as a separate PR to keep scope/review risk controlled.

I'd rather the original brief was fulfilled than some work merged and then another ticket created to do what the original work should have done. Given none of this functionality exists yet and there's no urgent need for this, that I'm aware of, the only downside I can see is there will be chunks of code in this PR we don't need any more. From your investigations to get to this point did you find anything that would make using the S3 bucket ingestion pathway problematic to implement?

…dev data

…311) instead of # noqa: S311

…ff and Bandit at the exact call sites.

…t hierarchy seeding, and end-to-end test coverage

…data

…KHSA-Internal/data-dashboard-api into task/CDD-3154-seed-random-data

sonarqubecloud · 2026-04-01T15:41:55Z

Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

abdihakim92x1 requested review from jrdh, luketowell and phill-stanley March 4, 2026 00:31

abdihakim92x1 self-assigned this Mar 4, 2026

abdihakim92x1 requested a review from a team as a code owner March 4, 2026 00:31

jeanpierrefouche-ukhsa reviewed Mar 5, 2026

View reviewed changes

metrics/interfaces/management/commands/seed_random.py Outdated Show resolved Hide resolved

metrics/interfaces/management/commands/seed_random.py Show resolved Hide resolved

metrics/interfaces/management/commands/seed_random.py Show resolved Hide resolved

jrdh force-pushed the task/CDD-3154-seed-random-data branch from 50b07fa to a70c824 Compare March 6, 2026 09:18

jrdh requested changes Mar 9, 2026

View reviewed changes

jrdh force-pushed the task/CDD-3154-seed-random-data branch 2 times, most recently from 4d2477c to c99d952 Compare March 9, 2026 13:54

abdihakim92x1 requested a review from jrdh March 10, 2026 08:59

jrdh requested changes Mar 19, 2026

View reviewed changes

jrdh force-pushed the task/CDD-3154-seed-random-data branch from b274ea0 to 3c728b1 Compare March 23, 2026 08:00

abdihakim92x1 added 15 commits March 27, 2026 12:39

feat: add deterministic seed_random management command for synthetic …

b4c7129

…dev data

lint fixes

8f2ea58

Patched Bandit suppression to use Bandit-compatible syntax (# nosec B…

7e40207

…311) instead of # noqa: S311

Updated the two flagged random.uniform(...) lines to suppress both Ru…

5531f69

…ff and Bandit at the exact call sites.

Added full unit coverage for seed_random

879da80

Fixed the failing test by matching Django’s actual exception type.

3147af4

addressing PR review except arch change

cfb5b05

ruff and pytest fixes

84b276c

linting fix #2

b55f498

make metrics seeding idempotent and enrich generated rows

2eb12b2

fixed linting and test coverage

500ac21

sonarqube check fixed

61135a1

sonarqube fix #2

d26a418

sonarqube fix #3

d891ef5

feat(seed-random): add S3 delivery mode, non-public option, idempoten…

3243354

…t hierarchy seeding, and end-to-end test coverage

abdihakim92x1 added 2 commits March 27, 2026 12:39

updated test seed random

e74869e

updated the system test request to force JSON response

75cc893

abdihakim92x1 force-pushed the task/CDD-3154-seed-random-data branch from 6a4ee5a to 75cc893 Compare March 27, 2026 12:39

abdihakim92x1 added 23 commits March 27, 2026 12:40

Merge remote-tracking branch 'origin' into task/CDD-3154-seed-random-…

e8e7547

…data

Fix: apply formatting black/ruff

74c40d8

properly ignore s311 for random usage

37da200

Ignore B311 for random usage in seed data

aca6ea2

Satisfy both ruff and bandit for random usage

ace51b4

Apply final formatting after nosec/noqa change

92ada1c

resolve lint

8bf6c55

Supress bandit B311

cf87934

Revert changes

f1702d5

Apply formatting changes from CI

3a70135

resolve s311 lint for random.choice inline

a317660

suppress bandit B311 using nosec

5d68451

satisfy both ruff and bandit for random usage

88e9bd9

resolve s311 lint issues properly

9de1ebd

Move random.choice to variable for lin stability

339c1b3

Single line to satisfy ruff

f11303f

Rudd and black fixed

02bdc9f

Ttest pipeline run

87952d2

Merge branch 'main' into task/CDD-3154-seed-random-data

9cccdb3

Test pipeline run

dbdef33

Merge remote-tracking branch 'origin' into task/CDD-3154-seed-random-…

4bf1314

…data

Merge branch 'task/CDD-3154-seed-random-data' of https://github.com/U…

c9dc386

…KHSA-Internal/data-dashboard-api into task/CDD-3154-seed-random-data

fixing system test issue

34dff22

Conversation

abdihakim92x1 commented Mar 4, 2026

Description

Type of change

Checklist:

Uh oh!

jeanpierrefouche-ukhsa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrdh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrdh Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abdihakim92x1 commented Mar 10, 2026 • edited by jrdh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrdh left a comment

Choose a reason for hiding this comment

Uh oh!

jrdh Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jrdh Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jrdh Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jrdh Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jrdh commented Mar 19, 2026

Uh oh!

sonarqubecloud bot commented Apr 1, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abdihakim92x1 commented Mar 10, 2026 •

edited by jrdh

Loading